{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Matching Base Pairs from a list of research papers to their respective bin locations and filtering for the smallest p-value\n",
    "\n",
    "Two datasets are provided containing the relevant data. The first, contained within the file *papers_bin_pvalues.csv*, is the a list of papers with chromosomes and base pairs investigated and a corresponding p-value for the hypothesis tested. The second, contained in the file *chromosome_bins.csv* is a list of bins with their corresponding chromosome, start base pair and end base pair."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import the data containing the combinations of chromosomes and their bin starting and ending base pairs. \n",
    "\n",
    "The datatypes are defined so that Panda's autodetect does not identify the `bin` values as float."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 240 entries, 0 to 239\n",
      "Data columns (total 4 columns):\n",
      "chromosome    240 non-null object\n",
      "start         240 non-null int64\n",
      "end           240 non-null int64\n",
      "bin           240 non-null object\n",
      "dtypes: int64(2), object(2)\n",
      "memory usage: 7.6+ KB\n"
     ]
    }
   ],
   "source": [
    "bins = pd.read_csv(\n",
    "    'chromosome_bins.csv',\n",
    "    dtype={\n",
    "        \"chromosome\": object,\n",
    "        \"start\": int,\n",
    "        \"end\": int,\n",
    "        \"bin\": object\n",
    "    }\n",
    ")\n",
    "bins.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chromosome</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>bin</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1</td>\n",
       "      <td>6902277</td>\n",
       "      <td>1.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr1</td>\n",
       "      <td>6902278</td>\n",
       "      <td>11404933</td>\n",
       "      <td>1.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr1</td>\n",
       "      <td>11404934</td>\n",
       "      <td>18107097</td>\n",
       "      <td>1.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr1</td>\n",
       "      <td>18107098</td>\n",
       "      <td>30481074</td>\n",
       "      <td>1.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chr1</td>\n",
       "      <td>30481075</td>\n",
       "      <td>40606936</td>\n",
       "      <td>1.5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  chromosome     start       end  bin\n",
       "0       chr1         1   6902277  1.1\n",
       "1       chr1   6902278  11404933  1.2\n",
       "2       chr1  11404934  18107097  1.3\n",
       "3       chr1  18107098  30481074  1.4\n",
       "4       chr1  30481075  40606936  1.5"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bins.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "chromosome    0\n",
       "start         0\n",
       "end           0\n",
       "bin           0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bins.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are no `null` values. All rows contain data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import the list of papers and their respective information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 191375 entries, 0 to 191374\n",
      "Data columns (total 7 columns):\n",
      "author       191375 non-null object\n",
      "author_id    191375 non-null int64\n",
      "chr_id       191375 non-null object\n",
      "bp           191375 non-null int64\n",
      "snp          186639 non-null object\n",
      "pvalue       190473 non-null float64\n",
      "bin          0 non-null object\n",
      "dtypes: float64(1), int64(2), object(4)\n",
      "memory usage: 10.2+ MB\n"
     ]
    }
   ],
   "source": [
    "papers = pd.read_csv(\n",
    "    'paper_bin_pavalue.csv', \n",
    "    dtype={\n",
    "        \"author\": object,\n",
    "        \"author_id\": int,\n",
    "        \"chr_id\": object,\n",
    "        \"bp\": int,\n",
    "        \"snp\": object,\n",
    "        \"pvalue\": float,\n",
    "        \"bin\": object\n",
    "    }\n",
    ")\n",
    "papers.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>author</th>\n",
       "      <th>author_id</th>\n",
       "      <th>chr_id</th>\n",
       "      <th>bp</th>\n",
       "      <th>snp</th>\n",
       "      <th>pvalue</th>\n",
       "      <th>bin</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bandres-Ciga</td>\n",
       "      <td>1</td>\n",
       "      <td>chr2</td>\n",
       "      <td>135539967</td>\n",
       "      <td>rs6430538</td>\n",
       "      <td>0.02195</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Bandres-Ciga</td>\n",
       "      <td>1</td>\n",
       "      <td>chr17</td>\n",
       "      <td>43994648</td>\n",
       "      <td>rs17649553</td>\n",
       "      <td>0.02547</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Bandres-Ciga</td>\n",
       "      <td>1</td>\n",
       "      <td>chr2</td>\n",
       "      <td>169129145</td>\n",
       "      <td>rs 1955337</td>\n",
       "      <td>0.03442</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Bandres-Ciga</td>\n",
       "      <td>1</td>\n",
       "      <td>chr11</td>\n",
       "      <td>133765367</td>\n",
       "      <td>rs329648</td>\n",
       "      <td>0.03458</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Bandres-Ciga</td>\n",
       "      <td>1</td>\n",
       "      <td>chr17</td>\n",
       "      <td>17715101</td>\n",
       "      <td>rs11868035</td>\n",
       "      <td>0.04570</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         author  author_id chr_id         bp         snp   pvalue  bin\n",
       "0  Bandres-Ciga          1   chr2  135539967   rs6430538  0.02195  NaN\n",
       "1  Bandres-Ciga          1  chr17   43994648  rs17649553  0.02547  NaN\n",
       "2  Bandres-Ciga          1   chr2  169129145  rs 1955337  0.03442  NaN\n",
       "3  Bandres-Ciga          1  chr11  133765367    rs329648  0.03458  NaN\n",
       "4  Bandres-Ciga          1  chr17   17715101  rs11868035  0.04570  NaN"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "papers.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "author            0\n",
       "author_id         0\n",
       "chr_id            0\n",
       "bp                0\n",
       "snp            4736\n",
       "pvalue          902\n",
       "bin          191375\n",
       "dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "papers.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are some null values for `snp` and `pvalue`. The `bin` column is intentionaly empty.\n",
    "The dataframe will copied and the empty entries will droped."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_papers = papers.copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "papers.dropna(subset=['snp', 'pvalue'], inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "author            0\n",
       "author_id         0\n",
       "chr_id            0\n",
       "bp                0\n",
       "snp               0\n",
       "pvalue            0\n",
       "bin          185832\n",
       "dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "papers.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are now 185832 entries down from 191375 for a reduction of 3% (5543) entries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Matching the two datasets\n",
    "\n",
    "The bin will be calculated from the `bins` dataframe comparing the start and end base pairs with the base pair of the each paper. The matching is done when the chromosome entries are equal **and** when the bin's *start base pair* is __less than or equal__ to the base pair of the paper **and** when the bin's *end base pair* is __greater than or equal__ to the base pair of the paper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4min 34s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit -r 1\n",
    "for idx, row in papers.iterrows():\n",
    "    bin_value = bins[\n",
    "        bins.chromosome.eq(row.chr_id) & \n",
    "        bins.start.le(row.bp) & \n",
    "        bins.end.ge(row.bp)\n",
    "    ].bin.values\n",
    "    papers.at[idx, 'bin'] = bin_value[0] if bin_value.size > 0 else None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the now populated `bin` column of the `papers` dataframe it is ease to retrieve only the min values for `pvalue` for each **bin** of each **author**. The resulting dataframe is stored in a new dataframe named `combinations`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "combinations = papers.loc[papers.groupby(['author_id', 'bin']).pvalue.idxmin()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 703 entries, 8 to 114735\n",
      "Data columns (total 7 columns):\n",
      "author       703 non-null object\n",
      "author_id    703 non-null int64\n",
      "chr_id       703 non-null object\n",
      "bp           703 non-null int64\n",
      "snp          703 non-null object\n",
      "pvalue       703 non-null float64\n",
      "bin          703 non-null object\n",
      "dtypes: float64(1), int64(2), object(4)\n",
      "memory usage: 43.9+ KB\n"
     ]
    }
   ],
   "source": [
    "combinations.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>author</th>\n",
       "      <th>author_id</th>\n",
       "      <th>chr_id</th>\n",
       "      <th>bp</th>\n",
       "      <th>snp</th>\n",
       "      <th>pvalue</th>\n",
       "      <th>bin</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>41201</th>\n",
       "      <td>Maraganore DM</td>\n",
       "      <td>19</td>\n",
       "      <td>chr3</td>\n",
       "      <td>110453390</td>\n",
       "      <td>2699976</td>\n",
       "      <td>2.348000e-03</td>\n",
       "      <td>3.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Bandres-Ciga</td>\n",
       "      <td>1</td>\n",
       "      <td>chr7</td>\n",
       "      <td>23293746</td>\n",
       "      <td>rs199347</td>\n",
       "      <td>3.881000e-01</td>\n",
       "      <td>7.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1267</th>\n",
       "      <td>Spencer CC</td>\n",
       "      <td>17</td>\n",
       "      <td>chr5</td>\n",
       "      <td>31270987</td>\n",
       "      <td>rs4457092</td>\n",
       "      <td>4.050000e-06</td>\n",
       "      <td>5.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>798</th>\n",
       "      <td>Pickrell JK</td>\n",
       "      <td>12</td>\n",
       "      <td>chr7</td>\n",
       "      <td>23084258</td>\n",
       "      <td>rs10256359</td>\n",
       "      <td>2.000000e-12</td>\n",
       "      <td>7.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>556</th>\n",
       "      <td>Foo JN</td>\n",
       "      <td>6</td>\n",
       "      <td>chr21</td>\n",
       "      <td>19566451</td>\n",
       "      <td>rs2824703</td>\n",
       "      <td>8.610000e-05</td>\n",
       "      <td>21.1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              author  author_id chr_id         bp         snp        pvalue  \\\n",
       "41201  Maraganore DM         19   chr3  110453390     2699976  2.348000e-03   \n",
       "12      Bandres-Ciga          1   chr7   23293746    rs199347  3.881000e-01   \n",
       "1267      Spencer CC         17   chr5   31270987   rs4457092  4.050000e-06   \n",
       "798      Pickrell JK         12   chr7   23084258  rs10256359  2.000000e-12   \n",
       "556           Foo JN          6  chr21   19566451   rs2824703  8.610000e-05   \n",
       "\n",
       "        bin  \n",
       "41201   3.9  \n",
       "12      7.3  \n",
       "1267    5.3  \n",
       "798     7.3  \n",
       "556    21.1  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "combinations.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataframe is not re-index so that a reference to the initial dataframe is kept. It will make lookups easier."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Storing the results\n",
    "\n",
    "The resulting filtered by minimum p-value matches are stored in a new file name *matches.csv* for further processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "combinations.to_csv('matches.csv')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "latex_metadata": {
   "author": "Arys E. Andreou",
   "title": "Matching Base Pairs to Chromosome Bins"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
